ik_llama.cpp uses the CPU as its base compute device. “Offloading” means sending specific tensors and operations to the GPU for processing. Because GPUs have faster memory bandwidth and parallel compute compared to CPU+RAM, the goal is to offload as much as possible to maximize tokens/second.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/ikawrakow/ik_llama.cpp/llms.txt
Use this file to discover all available pages before exploring further.
Core offload parameters
-ngl / —gpu-layers
Offload the first N transformer layers to VRAM. Pass999 to offload everything:
-ot / —override-tensor
Override where individual tensors are stored using regular expressions. This is the most powerful offload control available, particularly useful for MoE models where you want experts in RAM and everything else in VRAM.= is a regex matched against tensor names. The value after = is the target device (CPU, CUDA0, CUDA1, etc.).
Tensor names follow the pattern
blk.N.tensor_name. Run gguf_dump.py on your model to list all tensor names and identify the right regex pattern.—fit / —fit-margin
Automatically load as many tensors as available VRAM permits, without specifying an explicit layer count.| Parameter | Default | Notes |
|---|---|---|
--fit | off | Automatically fills VRAM. Cannot be combined with --cpu-moe, --n-cpu-moe, or -ot. |
--fit-margin N | 1024 MiB | Increase if you get CUDA OOM during model load. Decrease if too much VRAM is left unused. |
Multi-GPU configuration
- Single GPU
- Multi-GPU
For a single GPU, use Use
-ngl 999 to fully offload, or a lower number for partial offload:-mg to select which GPU to use when multiple are present but you only want one:MoE-specific offload options
For Mixture-of-Experts models, ik_llama.cpp provides dedicated parameters to control where expert weights live:| Parameter | Description |
|---|---|
--cpu-moe | Keep all MoE expert weights in RAM. Simple one-flag hybrid setup. |
--n-cpu-moe N | Keep MoE weights of the first N layers in RAM. Useful when some VRAM is available. |
-ooae / --offload-only-active-experts | When expert weights are in RAM, only copy the activated experts to VRAM for computation (reduces RAM→VRAM transfer). Default: ON. |
-no-ooae | Disable active-expert-only offload. May help when nearly all experts are activated (large batches). |
Per-operation offload control
-op / --offload-policy gives fine-grained control over which GGML operations run on GPU:
CUDA fine-tuning
-cuda / --cuda-params accepts a comma-separated list of CUDA-specific tuning options, including fusion control, GPU offload threshold, and MMQ-ID threshold:
Practical examples
Related pages
- Hybrid CPU/GPU inference — Detailed guide for running models that don’t fit in VRAM
- Parameters reference — Full GPU offload parameter reference